Tensorflow Boosted Trees Classifier

In this article, we demonstrate solving a classification problem in TensorFlow using Estimators using the Credit dataset. This dataset can be extracted from the ISLR package using the following syntax.

library (ISLR)
write.csv(Default, "Default.csv")

Dataset Information:

The Credit dataset contains the balance (average credit card debt for a number of individuals) as well as several quantitative predictors: age, cards (number of credit cards), education (years of education), income (in thousands of dollars), limit (credit limit), and rating (credit rating).

Attribute Description
Default A factor with levels No and Yes indicating whether the customer defaulted on their debt
Student A factor with levels No and Yes indicating whether the customer is a student
Balance The average balance that the customer has remaining on their credit card after making their monthly payment
Income Income of customer

Data Correlations

First off, the outcome variable here is Default. We have,

Moreover, there is another categorical variable, Student.

Nowm, let's take a look at the variance of the features.

Furthermore, we would like to standardize features by removing the mean and scaling to unit variance.

Train and Test sets

Input Function

The input function specifies how data is converted to a tf.data.Dataset that feeds the input pipeline in a streaming fashion. Moreover, an input function is a function that returns a tf.data.Dataset object which outputs the following two-element tuple:

Moreover, an estimator model consists of two main parts, feature columns, and a numeric vector. Feature columns provide explanations for the input numeric vector. The following function separates categorical and numerical columns (features)and returns a descriptive list of feature columns.

Boosted Trees Classifier

Predictions

ROC Curves

Confusion Matrix

Boosted Trees Classifier with $l_1$ regularization (Lasso)

Lasso (least absolute shrinkage and selection operator) classifier was introduced within the context of the method of least squares. Lasso) alters the model fitting process to pick only a subset of the provided covariates to be used within the final model instead of using all of them and this will improve the prediction accuracy and interpretability of regression models.

Predictions

ROC Curves

Confusion Matrix

Boosted Trees Classifier with $l_2$ regularization (Ridge)

Predictions

ROC Curves

Confusion Matrix


References

  1. Regression analysis wikipedia page
  2. Tensorflow tutorials
  3. TensorFlow Boosted Trees Classifier
  4. Lasso (statistics))
  5. Tikhonov regularization
  6. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, pp. 3-7). New York: springer.
  7. Jordi Warmenhoven, ISLR-python
  8. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2017). ISLR: Data for an Introduction to Statistical Learning with Applications in R